data wrangling [English]


InterPARES Definition

n. ~ The process of preparing irregular or incomplete datasets by analyzing the nature of the dataset, restructuring it, normalizing values, enhancing it with additional data, validating the data, and making it available for use.

Citations

  • DataWatch 2017 (†869 ): The process of cleaning and unifying messy and complex data sets for easy access and analysis. ¶ With the amount of data and data sources rapidly growing and expanding, it is getting more and more essential for the large amounts of available data to be organized for analysis. ¶ This process typically includes manually converting/mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. (†2612)
  • DataWatch 2017 (†869 ): The goals of data wrangling: · Reveal a “deeper intelligence” within your data, by gathering data from multiple sources · Provide accurate, actionable data in the hands of business analysts in a timely matter · Reduce the time spent collecting and organizing unruly data before it can be utilized · Enable data scientists and analysts to focus on the analysis of data, rather than the · wrangling · Drive better decision-making skills by senior leaders in an organization ¶ The key steps to data wrangling: · Data Acquisition: Identify and obtain access to the data within your sources · Joining Data : Combine the edited data for further use and analysis · Data Cleansing: Redesign the data into a usable/functional format and correct/remove any bad data (†2613)
  • Lorican 2016 (†745 ): The most interactive tasks that people do with data are essentially data wrangling. You’re changing the form of the data, you’re changing the content of the data, and at the same time you’re trying to evaluate the quality of the data and see if you’re making it the way you want it. … It’s really actually the most immersive interaction that people do with data and it’s very interesting. (†1852)
  • Rattenbury 2015 (†872 ): Data wrangling [is] a process that includes six core activities. . . . 1. Discovering is something of an umbrella term for the entire process; in it, you learn what is in your data and what might be the best approach for productive analytic explorations. 2. Structuring is needed because data comes in all shapes and sizes. 3. Cleaning involves taking out data that might distort the analysis. 4. Enriching allows you to take advantage of the wrangling you have already done to ask yourself: “Now that I have a sense of my data, what other data might be useful in this analysis?” Or, “What new kinds of data can I derive from the data I already have?” 5. Validating is the activity that surfaces data quality and consistency issues, or verifies that they have been properly addressed by applied transformations. 6. Publishing refers to planning for and delivering the output of your data wrangling efforts for downstream project needs (like loading the data in a particular analysis package) or for future project needs (like documenting and archiving transformation logic). (†2617)
  • Rattenbury et al. 2017 (†871 p. ix): The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data. . . . 50 to 80 percent of an analyst’s time is spent wrangling data to get it to the point at which this kind of analysis is possible. Not only does data wrangling consume most of an analyst’s workday, it also represents much of the analyst’s professional process: it captures activities like understanding what data is available; choosing what data to use and at what level of detail; understanding how to meaningfully combine multiple sources of data; and deciding how to distill the results to a size and shape that can drive downstream analysis. . . . and in the context of agile analytics, these activities also capture the creative and scientific intuition of the analyst, which can dictate different decisions for each use case and data source. (†2616)
  • Tomar 2016 (†870 ): It is often the case with data science projects that you’ll have to deal with messy or incomplete data. The raw data we obtain from different data sources is often unusable at the beginning. All the activity that you do on the raw data to make it “clean” enough to input to your analytical algorithm is called data wrangling or data munging. If you want to create an efficient ETL pipeline (extract, transform and load) or create beautiful data visualizations, you should be prepared to do a lot of data wrangling. (†2614)
  • Tomar 2016 (†870 ): Data wrangling is an important part of any data analysis. You’ll want to make sure your data is in tip-top shape and ready for convenient consumption before you apply any algorithms to it. Data preparation is a key part of a great data analysis. By dropping null values, filtering and selecting the right data, and working with timeseries, you can ensure that any machine learning or treatment you apply to your cleaned-up data is fully effective. (†2615)